-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Crystal::Hasher for computing a hash
value
#4946
Conversation
This implements the code organization part of #4675 but not the actual "good" implementation. But this PR is simpler to review. |
Also fixes #4578 |
@asterite I don't think "fixes #foo" outside the main PR body actually gets github to close the issues (you can actually see on the issue page if a PR will close it now, which is cool) |
@RX14 Thanks, updated! |
src/crystal/hasher.cr
Outdated
end | ||
|
||
def float(value) | ||
@value = @value * 31 + value.to_f64.unsafe_as(UInt64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use this on the floats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it for a next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a big todo at the top of the struct. On fact in this PR I could have used zero as a seed, but I wanted to see that it worked with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you are asking is not exactly covered by the above TODO. Still, I'd like to tackle these issues one at a time. After we merge this, I'll probably wait for someone (you? :-)) to use float normalization, though I'll require a test to see why it's needed. Then another PR can tackle another (better) hash algorithm, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
canonicalize
is OK, but implementation proposed by @funny-falcon (PyHash inspired) is more robust and generic normalization that takes in care canonicalization too. Will propose it as separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how the pattern take advantage of the struct and chain the results. 💙
It's a pity that all the methods in the hasher have a value
argument that hide the getter of the sole ivar @value
. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, and it avoids pointers. Now I'd love to see halfsiphash and siphash hashing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to add add a simple explanatory doc string to each hash(hasher)
implementation, similar to the one in the example for def_hash
: Define a `hash(hasher)` method based on @name and @age
.
At least all the obsolete documentation should be updated or removed.
src/enum.cr
Outdated
# Returns a hash value. This is the hash of the underlying value. | ||
def hash | ||
value.hash | ||
def hash(hasher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a "See Object#hash(hasher)
to all of these
src/float.cr
Outdated
@@ -86,6 +86,10 @@ struct Float | |||
end | |||
end | |||
|
|||
def hash(hasher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
src/hash.cr
Outdated
# ``` | ||
def hash | ||
hash = size | ||
def hash(hasher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
src/bool.cr
Outdated
@@ -42,8 +42,8 @@ struct Bool | |||
end | |||
|
|||
# Returns a hash value for this boolean: 0 for `false`, 1 for `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fixed hash value should be removed.
src/char.cr
Outdated
@@ -415,8 +415,8 @@ struct Char | |||
end | |||
|
|||
# Returns this char's codepoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc string is no longer correct.
src/xml/namespace.cr
Outdated
def hash | ||
object_id | ||
end | ||
def_hash object_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
src/xml/node.cr
Outdated
def hash | ||
object_id | ||
end | ||
def_hash object_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
src/xml/node_set.cr
Outdated
def hash | ||
object_id | ||
end | ||
def_hash object_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
src/yaml/any.cr
Outdated
def hash | ||
raw.hash | ||
end | ||
def_hash raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
src/json/any.cr
Outdated
def hash | ||
raw.hash | ||
end | ||
def_hash raw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc
spec/std/float_spec.cr
Outdated
@@ -206,11 +206,13 @@ describe "Float" do | |||
|
|||
describe "hash" do | |||
it "does for Float32" do | |||
1.2_f32.hash.should_not eq(0) | |||
1.2_f32.hash.should eq(1.2_f32.hash) | |||
1.2_f32.hash.should_not eq(1.3_f32.hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this expectation is wrong in common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this then. Thanks
spec/std/float_spec.cr
Outdated
end | ||
|
||
it "does for Float64" do | ||
1.2.hash.should_not eq(0) | ||
1.2.hash.should eq(1.2.hash) | ||
1.2.hash.should_not eq(1.3.hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/crystal/hasher.cr
Outdated
end | ||
|
||
def float(value) | ||
@value = @value * 31 + value.to_f64.unsafe_as(UInt64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
canonicalize
is OK, but implementation proposed by @funny-falcon (PyHash inspired) is more robust and generic normalization that takes in care canonicalization too. Will propose it as separate PR.
I forced pushed a few fixes (maybe next time I'll add more commits so it's simpler to review) @bcardiff I changed the instance variable name of Hasher to I also changed one spec that would randomly fail. This is when a hash has the values, for example, |
when values are equal, their hashes should be equal too (as documented in Crystal documentation, and it's implemented in Python). This is number normalization that was introduced by #4675 pull request, and will be proposed again later (just grep for |
Number::Normalize returns same hashes for same values. 1.hash, 1.0.hash,
1.to_big_f.hash etc. always be equal with number normalization
пн, 11 сент. 2017 г. в 16:35, Ary Borenszweig <[email protected]>:
… @RX14 <https://github.com/rx14> @akzhan <https://github.com/akzhan> I
don't see how float normalization will solve that particular issue. Care to
explain?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4946 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG3hNPGgpVHChbYtKBW_mlJyFDJDy1wks5shTc2gaJpZM4PSeU3>
.
|
The distinction is between float normalization which transforms a float into an unambiguous bitpattern, and number hash normalization which generates a unique hash from a unique numerical value. I guess thats why LLVM uses canonicalization to refer to the former. We should do the same. |
But hash normalization for numbers will make this problem worse, because {1 => 'a', 1.0 => 'b'} will end up being: {1.0 => 'b'} because the second key replaces the first one, they are indistinguishable. One way for that to work is not to use |
@asterite What we were suggesting is exactly that behaviour: you cannot insert two "ones" into a hash - regardless of their representation. |
@RX14 Sounds good for now. I don't think it's consistent with union types, but we can delay this "issue" (if it is one) for later. |
Merged! Feel free to open issues regarding number normalization and using other hashing algorithms, or correct hash values for big numbers. |
In fact, as I understand it, now numbers of two different types like |
# TODO: the implementation is naive and should probably use | ||
# another algorithm like SipHash or FNV. | ||
|
||
@@seed = uninitialized UInt64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@seed = Random::System.rand(UInt64::MIN..UInt64::MAX)
with #4894
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It breaks the compiler at all with compiler bug inside. I don't know why :)
But I'll pull request with Random::Secure.
end | ||
|
||
def float(value) | ||
@result *= @result * 31 + value.to_f64.unsafe_as(UInt64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this *=
intended?
I really dislike that this got merged in only 19 hours. I didn't have the opportunity to see it. |
@oprypin Don't worry, this is not the final code. This just introduces |
The general idea is fine, if not the most elegant. In this case it's more about the small things (please re-check that |
I'm sorry, but @funny-falcon code was the same in terms of 'final', but a lot better. Your things are great, but now we should repeat all of its great proposals. Ok. |
@akzhan Maybe, but I don't think so. That PR adds a It also introduces Anyway, it had a lot more code than what was needed. And it was so big it received many comments until the author got tired. I think it's better to do this by little steps. |
BTW: I ported SipHash and made a streaming version: In terms of performance, computing the hash in one bulk was on par with the official C implementation. Streaming the input adds a ~20% overhead, because the state must be regularly updated. key = uninitialized UInt8[16]
SecureRandom.random_bytes(key.to_slice)
hasher = SipHash64(1, 3).new(key)
hasher.update("some ")
hasher.update("incoming ")
hasher.update("data ")
hasher.final # => UInt64 |
BTW again: I published a shard for SipHash and HalfSipHash: https://github.com/ysbaddaden/siphash.cr |
@ysbaddaden any reason why this isn't a PR to replace the default implementation? I think |
Yes, it may be good proposal (but @funny-falcon one is better for LLVM i suppose). Anyway I don't try to propose any implementation until pushed number normalization. |
And please leave |
This PR is a preparation to better implement the
hash
method for Crystal objects.Instead of asking types to implement
hash
, we now provide a default implementation and ask subclasses to implementhash(hasher)
. A hasher provides a safe way to compute a hash, and can also compute a different value for the same program at different runs to prevent DoS attacks. So this fixes #2817The current implementation lives in
Crystal::Hasher
and it's not public. The idea is not to leak this information. For example in Ruby you can't pass a hasher to methods, and you can't choose a hasher. This is all decided by Ruby's runtime, and hidden, and this is good because it removes the burden from the user.The implementation here is safe(r) because it randomizes the initial hash value. However, I'm sure we could use SipHash or FNV or some other algorithm. But this is just a start. Next PRs should just touch
Crystal::Hasher
(and probablyBigInt
and others because the hash is not correct there).Fixes #2817
Fixes #4578